Fix non-deterministic node pks causing duplicate nodes across runs#17
Open
mscottford wants to merge 11 commits intovladbatushkov:mainfrom
Open
Fix non-deterministic node pks causing duplicate nodes across runs#17mscottford wants to merge 11 commits intovladbatushkov:mainfrom
mscottford wants to merge 11 commits intovladbatushkov:mainfrom
Conversation
- Targets net10.0 (up from net9.0) - Buildalyzer/Workspaces 7.1.0 → 8.0.0 - Microsoft.CodeAnalysis/CSharp 4.13.0 → 5.3.0 (Analyzers pin removed) - Neo4j.Driver 5.27.0 → 6.0.0 - System.CommandLine 2.0.0-beta4 → 2.0.5 (first stable release) - Dockerfile base image updated to dotnet/sdk:10.0-alpine - Neo4j image updated to 2026.03.1 with corrected env var names - docker-compose.yml: removed obsolete version attribute, replaced hardcoded Windows volume path with relative ./SystemUnderTest path - Code updated for breaking API changes in System.CommandLine 2.0 (SetAction, Options.Add, Aliases.Add, Required, AllowMultipleArgumentsPerToken) and Neo4j.Driver 6.0 (IDriver.CloseAsync removed, use await using) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…from analysis When AddToWorkspace(workspace, addProjectReferences: true) is called for a project, Buildalyzer eagerly adds its referenced projects into the workspace. The previous deduplication check silently dropped those projects on their own loop iteration, causing them to receive no CONTAINS triple and no code analysis. Adds a regression test using a two-project fixture solution where ProjectA references ProjectB — ensuring both appear in the analysis context regardless of workspace insertion order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Stage 1: Build projects in parallel using .AsParallel() in GetAnalysisContext; each p.Build() spawns an isolated MSBuild process so there is no shared state. AdhocWorkspace mutations remain sequential (not thread-safe). - Stage 3: Analyze and insert all projects concurrently using Task.WhenAll with Task.Run so CPU-bound syntax tree walking runs across thread-pool threads. The delete step is extracted to DbManager.DeleteData and run once upfront before the parallel loop to avoid races. - Stage 4: A SemaphoreSlim(4) gates concurrent Neo4j sessions to avoid exhausting the connection pool on large solutions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The workspace population step (AddToWorkspace) is the slowest sequential phase on large solutions and previously emitted no output, causing a silent pause between "Building projects - finished." and "done." Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l to complete Previously the pipeline had two hard barriers: all projects had to finish building before workspace loading began, and all projects had to finish loading before analysis began. The Build stage now launches all MSBuild design-time builds concurrently via Task.WhenEach. As each build completes it flows immediately into the Load stage (sequential AdhocWorkspace mutation) and then yields a (Project, IAnalyzerResult) pair to the caller. The Analyze and Insert stages pick up each pair as it arrives via IAsyncEnumerable, so analysis of early- finishing projects begins while later projects are still building and loading. The solution-order sort is removed — AddToWorkspace(addProjectReferences:true) handles any insertion order correctly via the transitive-reference path introduced in the previous fix. GetAnalysisContext is updated to async Task<AnalysisContext> to consume the stream; the regression test is updated accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AddToWorkspace(addProjectReferences: true) calls analyzer.Build() internally for any referenced project not yet in the workspace. When Build and Load were interleaved via Task.WhenEach, in-flight build tasks raced with AddToWorkspace's internal builds for the same project, causing Buildalyzer's environment detection to fail with InvalidOperationException: Could not find build environment. Fix: wait for all builds to complete (Task.WhenAll) before starting the Load stage. Build → Load streaming is lost, but Load → Analyze streaming is retained: each project is yielded as soon as it is loaded, so analysis begins immediately without waiting for all projects to finish loading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all projects were built in parallel with no limit, spawning one dotnet process per project simultaneously. A SemaphoreSlim(Environment.ProcessorCount) now gates the Build stage so at most one build per logical core runs at a time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
string.GetHashCode() is randomized per-process in .NET Core and later (hash randomization was introduced to prevent hash-flooding DoS attacks). Using it as the Neo4j pk meant every run produced a different pk for the same logical node, defeating MERGE and silently creating duplicate nodes instead of updating existing ones. Symptom: projects shared between two solutions (e.g. via a git submodule) appeared as two separate Project nodes in the graph rather than one node with CONTAINS edges from both Solution nodes. Fix: replace GetHashCode() with MD5 in all SetPrimaryKey overrides. MD5 is stable across processes and runtimes and is appropriate here since this is a content-addressing use case, not a security one. Note: existing databases will have stale pks. Re-analyze with --delete to clear and rebuild the graph with deterministic pks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Duplicate node bug:
string.GetHashCode()is randomized per-process in .NET Core and later. Using it as the Neo4jpkmeant every run produced a different value for the same logical node, defeatingMERGEand silently creating duplicate nodes. The symptom was projects shared between two solutions (e.g. via a git submodule) appearing as two separateProjectnodes withCONTAINSedges from each solution, rather than one shared node with edges from both.Fix: Replaced
GetHashCode()with MD5 in all threeSetPrimaryKeyoverrides (Node,MethodNode,PackageNode). MD5 is deterministic across processes and runtimes; since this is content-addressing rather than security, it's the right tool.Neo4j concurrency: Also bumps the
SemaphoreSlimcap for concurrent Neo4j insert sessions from 4 toEnvironment.ProcessorCount, consistent with the same cap used for the build stage.Test plan
dotnet test— existing regression test should pass--deleteand confirm no duplicate nodes are createdCONTAINSedges from both solutions🤖 Generated with Claude Code